A System to Mine Large-Scale Bilingual Dictionaries from Monolingual Web Pages

نویسندگان

  • Guihong Cao
  • Jianfeng Gao
  • Jian-Yun Nie
چکیده

This paper describes a system that automatically mines EnglishChinese translation pairs from large amount of monolingual Chinese web pages. Our approach is motivated by the observation that many Chinese terms (e.g., named entities that are not stored in a conventional dictionary) are accompanied by their English translations in the Chinese web pages. In our approach, candidate translations are extracted using pre-defined templates. Transliterations and translation pairs are then identified using statistical learning methods. We compare several approaches to aligning transliterations and mining translations on more than 300GB Chinese web pages. In our experiments on MSN query log, we show that the mined bilingual dictionary greatly enlarges the coverage of an existing English-Chinese dictionary. It also improves query translation in cross-language information retrieval, leading to significantly higher retrieval effectiveness in on TREC collections.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Modeling Monolingual And Bilingual Collocation Dictionaries In Description Logics

This paper discusses an approach to modeling monolingual and bilingual dictionaries in the description logic species of the OWL Web Ontology Language (OWL DL). The central idea is that the model of a bilingual dictionary is a combination of the models of two monolingual dictionaries, in addition to an abstract translation model. The paper addresses the advantages of using OWL DL for the design ...

متن کامل

On multiword lexical units and their role in maritime dictionaries

Multi-word lexical units are a typical feature of specialized dictionaries, in particular monolingual and bilingual maritime dictionaries. The paper studies the concept of the multi-word lexical unit and considers the similarities and differences of their selection and presentation in monolingual and bilingual maritime dictionaries. The work analyses such issues as the classification of multi-w...

متن کامل

Word etymology in monolingual and bilingual dictionaries: lexicographers2 versus EFL learners2 perspectives

This paper deals with the treatment of word etymology in monolingual and bilingual dictionaries. It also investigates EFL learners' attitudes towards the importance of etymology for understanding the meaning of the words they look up in dictionaries. The data were collected through tasks of looking up Arabic loan words in English in monolingual and bilingual dictionaries. The results indicate t...

متن کامل

Machine Translation Detection from Monolingual Web-Text

We propose a method for automatically detecting low-quality Web-text translated by statistical machine translation (SMT) systems. We focus on the phrase salad phenomenon that is observed in existing SMT results and propose a set of computationally inexpensive features to effectively detect such machine-translated sentences from a large-scale Web-mined text. Unlike previous approaches that requi...

متن کامل

MT Express

The machine translation systems that are being developed at CRL are designed for assimilation purposes and are targeted at a large variety of source texts, including news articles, Web pages, newsgroups articles and email traffic. Thus, coverage and robustness are emphasized over depth of analysis, and accuracy over stylistic fluidity. Moreover, these systems are for the most part developed und...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007